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The Winnow Algorithm 

We now turn to an algorithm called the Winnow Algorithm developed by Nick Littlestone 
that performs especially well when many of the features given to the learner turn out to be 
irrelevant. Like the Perceptron Training Procedure discussed in the previous lectures, Win¬ 
now uses linear threshold functions as hypotheses and performs incremental updates to its 
current hypothesis. Unlike the Perceptron Training Procedure, Winnow uses multiplicative 
rather than additive updates. Winnow was developed with the goal of providing a significant 
Mistake Bound improvement when r <C n. 

We will analyze the Winnow algorithm for learning the class of C of {monotone disjunctions 
of r variables}. Note that Winnow or generalizations of Winnow can handle other specific 
concept classes (e.g. non-monotone disjunctions, majority functions, linear spearators), but 
the analysis is simplest in the case of monotone disjunctions. 

The Algorithm 

Both Winnow and Perceptron Algorithms use the same classification scheme: 

• w • x > 9 =>■ positive classification 

• w • x < 9 =>■ negative classification 

For convenience, we assume that 9 — n and we initialize w to the “all ones” vector, i.e., 
Wi = 1 for all i 1 . 

The Winnow Algorithm differs from the Perceptron Algorithm in its update scheme. When 
misclassifying a positive training example x (i.e. the predition was negative because w • x 
was too small): 


Vxi = 1 : Wi 2 

When misclassifying a negative training example x (i.e. prediction was positive because wx 
was too large): 

1 Note that Perceptron used a threshold of 0 but here we use a threshold of n. 
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Vxi = 1 : Wi G- Wi/2. 


Notice that because we are updating multiplicatively, all weights remain positive; so, to 
handle a non-monotone target concept we would need to transform the input space to have 
x, be a new variable. Intuitively, Winnow does more with the training information than the 
traditional list-and-cross-off scheme, because now in order to predict positive there must be 
“enough” evidence; thus we make more progress when we make a mistake. As we will see 
later, this is why Winnow guarentees a better performance bound for learning disjunctions 
when r is small. 

The Mistake Bound Analysis 

We can show the following guarantee: 

Theorem 1 The Winnow Algorithm learns the class of monotone disjunctions in the Mis¬ 
take Bound model, making at most 2 + 3r(l + logn) mistakes when the target concept is an 
OR of r variables. 

Proof: Let X r = {x^, x i2 ,..., x ir } be the r relevant variables in our target concept. Let 
W r = {w^jWiz,... ,w ir } be the weights of the relevant variables. Let w(t) denote the value 
of weight w at time t and let TW ( t ) be the Total Weight of the LTF (including both relevant 
and irrelevant variables) at time t. 

We will first bound the number of mistakes that will be made on positive examples. Note 
first that any mistake made on a positive example must double at least one of the weights in 
the target function (the relevant weights). So if at time t we misclassify a positive example 
we have: 


G W r such that w(t + 1) = 2 w{t) (1) 

Moreover a mistake made on a negative example will not halve any of the relevant weights, 
by definition of a disjunction; so for all times t, we have: 

\/w G W r , w(t + 1) > w(t) (2) 

Moreover, each of these weights can be doubled at most 1 + log(n) times, since only weights 
that are less than n can ever be doubled. Combining this together with (1) and (2), we get 
that Winnow makes at most M + < r( 1 + log(n)) mistakes on positive examples. 

We now bound the number of mistakes made on negative examples. Note first that a mistake 
made on a positive example increases the total weight by at most n. To see this assume that 
we made mistake on the positive example x at time t. We must have: 

wi(t)xi + ... + w n (t)x n < n 
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Since 


TW(t + 1) = TW(t) + (wi(t)x 1 + ... + w n (t)x n ), 


we get 


TW(t + 1) < TW(t) + n. (3) 

Similarly, we can show that each mistake made on a negative example decreases the total 
weight by at least nj 2. To see this assume that we made mistake on the negative example 
x at time t. We must have: 


wi(t)xi + ... + w n (t)x n > n. 


Since 


TW(t + 1) = TW(t) - (wi(t)xi + ... + w n (t)x n )/ 2, 


we get 


TW(t + 1) < TW{t) - nj 2. (4) 

Finally, the total weight never drops below zero, i.e., at all times: 

TW(t) > 0 (5) 

Combining equations (4), (3), and (5) we get: 

0 < TW(t) < TW (0) + nM + - (■ n/2)M_ (6) 

The total weight summed over all the variables is initially n since w(0) = 1. Solving (6) 
we get M_ < 2 + 2 M + . Combining both the negative and positive mistakes, we get that 
Winnow obtains makes at most 2 + 3r(l + log n) mistakes when the target concept is an OR 
of r variables. ■ 

Note: One can easily show that Winnow makes Q(r log n) mistakes in the worst case. 

Interesting Open Question: Is there a computationally efficient algorithm for learning 
decision lists in the mistake bound model with a mistake bound poly(L, log n), where L is 
length of target decision list? Note that the so called halving algorithm achieves this bound, 
but it is not a computationally efficient algorithm. 
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Winnow versus Perceptron One can generalize the basic analysis we did for Winnow 
to the case of learning linear separators; the guarantee depends on the L i, L^ margin of the 
target. In particular, if the target vector w* is a linear separator such that w* ■ x > c on 
positives and w* ■ x < c — a on negatives, then the mistake bound of Winnow is 

0((L i (w*)L oo (X)/a) 2 log(n)). 

The quantity 7 = a/[Li(ta*)L 00 (A^)] is called the “L 1 , L^” margin of the separator, and 
our bound is 0 ((l/ 7 2 ) • log(n)). On the other hand, the Perceptron algorithm has a mistake 
bound of 0 (l/ 7 2 ) where 7 = a/[L 2 (w*)L 2 (X)] (this is called the “L 2 , L 2 V margin of the 
separator). 

Intuitively, if “n” is large but most features are irrelevant (i.e. target is sparse but examples 
are dense), then Winnow is better because adding irrelevant features increases L 2 (X) but not 
Loo(X). On the other hand, if the target is dense and examples are sparse, then Perceptron 
is better. 


4 



